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Abstract 

We propose a novel variance component approach for the analysis of next-generation sequencing data. Our 
method is based on the detection of the proportion of the trait phenotypic variance that can be explained by the 
introduction of a new variance component that accounts for the local gene-specific departure of the empirical 
kinship relationship matrix, estimated from single-nucleotide polymorphism (SNP) genotypes, from their theoretical 
expectation based on the genealogical information in the pedigree. We tested our method with simulated 
phenotypes and imputed SNP genotypes from the Genetic Analysis Workshop 18 data set. We observed 
considerable variation in the differences between theoretical and gene-specific kinship estimates that proved to be 
informative for our test and allowed us to detect the MAP4 causal gene at a genome-wide significance level. The 
distribution of our test statistic show no inflation under the null hypothesis and results from a random set of genes 
suggest that the detection of MAP4 is both sensitive and specific. The use of 2 different strategies for the selection 
of the SNPs used to derive the gene-specific empirical kinship relationship matrices provides us with suggestive 
evidence that our method is performing as an empirical test of linkage. 



Background 

Complex phenotypes are thought to be determined by 
the aggregate effects of many rare causal variations 
[1-3]. Detection of the true causal variations present in 
next- generation sequencing data sets [4,5] is challenging 
because their faint signals are difficult to separate from 
background noise. Most of the current analytical meth- 
ods try to improve the signal-to-noise ratio by reducing 
the number of statistical tests needed for a significant 
signal to be detected. 

A common approach to alleviate the multiple-testing 
problem is to collapse, commonly by membership of a 
variant in a known annotated gene or pathway, the infor- 
mation conveyed by individual variants into a single mea- 
sure, like a principal component or a weighted rank, that 
can then be tested [6]. However, a common limitation of 
many approaches is that the aggregation of the variants 
into a single measure often involves an arbitrary defini- 
tion of the directionality of each variant's fixed effects. 



We present a novel random-effect-variance compo- 
nent-based approach that uses gene-specific relationship 
matrices to collapse variants into a per-gene genetic 
contribution effect. 

Methods 

Data set 

The Genetic Analysis Workshop 18 (GAW18) data [7], 
based on whole genome sequencing data for the odd-num- 
bered chromosomes of 464 individuals released by the 
T2D-GENES Consortium, was used to test our method. 
Specifically, we used pedigrees, minor allele-based single- 
nucleotide polymorphism (SNP) dosages, and the SIM- 
PHEN.l simulated phenotypes in the GAW18 data set. 

Definition of the gene loci 

The transcription start site and the stop codon coordi- 
nates for the longest transcript associated with a gene 
were obtained from the UCSC's human genome release 
19 (hgl9) known gene table. 
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Gene-specific SNP dosages 

To investigate if the procedure used to select the SNPs 
that were collected on a per-gene locus basis affected 
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our test results, we used 2 different SNP selection 
approaches: the intragenic and the nonsyn strategies. 
The intragenic strategy consisted of the selection of all 
SNPs within the bounds of a gene. The nonsyn strategy 
consisted of the selection of the subset of intragenic 
SNPs that were annotated as being nonsynonymous 
coding changes using ANNOVAR [8]. GAW18 SNP 
dosages from the imputed genotypes where then col- 
lected into separate, gene-specific, dosage files for SNPs 
selected using the intragenic and nonsyn strategies. 

Gene-specific empirical kinship matrices 

Gene-specific dosages were transformed into genotypes 
and processed with KING [9], a method for relationship 
inference from large SNP genotype data sets that is 
robust to population substructure, to produce a gene- 
specific matrix of empirical kinship coefficients. 

Control for unknown population substructure 

To control for possible population stratification, principal 
component loadings were calculated using the prcomp 
function in R [10], with data from 117 unrelated indivi- 
duals for approximately 29,000 haplotype tagging SNPs 
in low mutual linkage disequilibrium, and then projected 
onto the full set of genotyped individuals. The first 5 
principal components explained 5% of the total phenoty- 
pic variance and were added as covariates to our variance 
component model. 

Trait and covariates 

We used the simulated phenotypic data at the first exam 
for the systolic blood pressure (SBP l) trait. The sex 
(SEX), age (AGE_1), and smoke (SMOKE_l) status at the 
first exam phenotypes were introduced as covariates into 
our variance component model. The Ql trait was used to 
assess the distribution of our test statistic under the null 
hypothesis. 

Variance component model 

Our method uses gene-specific relationship matrices 
(GSRMs) to extract the proportion of the trait's variance 
explained by a single gene as a result of the departure of 
its localized empirical kinship estimates (EKEs) from 
their pedigree-derived theoretical kinship expectations 
(TKEs). A new variance component parameter (h^^.^) 
was introduced into a standard variance component 
model 

where S2 is the covariance matrix, crphemtypic '^he total 
phenotypic variance; h^, h^eff' ^'^'^ e^' respectively, repre- 
sent the proportion of Op^^^^^^^^ that can be attributed to 



the residual additive effect of polygenes, a gene-specific 
effect; and a random environmental effect, <t>, is the TKE 
kinship matrix, £ is the EKE kinship matrix, and / is the 
identity matrix. This partitioning of the trait variance 
was estimated using an extension of the polygenic com- 
mand from SOLAR [11] independently for each gene. 
The significance of each h^gjy estimate was obtained 
from a likelihood ratio test against the null model 

Because the variance component /i^^^ is tested on its 
boundary, the likelihood ratio test statistic is distributed 
as a Vr}/! mixture of a 1 degree of freedom (DF) chi- 
square and a point mass at zero [12]. 

Results 

We compared the observed gene-specific EKE values 
obtained from the imputed SNP dosages with the TKE 
values derived from the pedigree and found substantial 
differences between them (Figure 1). The negative skew 
in Figure 1 shows that gene-specific EKE values are lar- 
ger than their TKE counterparts and it shows that for 
certain genes individuals appear to be more closely 
related than expected from their relatedness in the 
pedigree. 

We then performed variance component analyses 
using GSRMs with intragenic and nonsyn EKE values 
for 12 of the causal SBP l genes in the simulated data 
set (Table 1) and a random gene sample (Table 2). We 
detected a clear and significant signal from the MAP4 
causal gene using both the intragenic and nonsyn strate- 
gies, that reached genome-wide significance (after a con- 
servative Bonferroni correction for 30,000 tests, p <1.6 x 
10 in the nonsyn (Table 1). The magnitude of the 
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Figure 1 Distribution of the gene-specific differences between 
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Table 1 Estimated effects on the simulated SBP_1 trait for known causal genes 
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geff. Gene-specific effect estimate (hgeff)' Q^-P: significance of the gene-specific effect estimate; h2r, trait heritability estimate (h^); h2r_p, significance of the 
trait heritability estimate. 



MAP4 signal is strong enough for it to be specifically 
detected as the top result in a random sample of 100 
genes (Table 2). Other causal genes also rank among 
the top results, but their signals are weaker (Table 2). 
Figure 2 suggests that our approach has the sensitivity 
to separate true-positive signals from false-positive ones, 
as there is no inflation or deflation of the p values that 
we obtained for the estimates of the gene effects evalu- 
ated under the null hypothesis. 

Discussion 

We performed variance component analyses using a 
novel approach to estimate the proportion of the trait 
phenotypic variance that can be attributed to a single 
gene. We first collapsed the genotypes from SNP variants 



into a GSRM that more closely approximates the correla- 
tions between related individuals at a gene-specific level. 
Figure 1 shows that there is substantial variation among 
genes in terms of the differences between TKE and gene- 
specific EKE values that had the potential to explain part 
of the trait variance. Thus, we then obtained gene-speci- 
fic estimates of the h^^g parameter and its significance 
from SOLAR, using the empirical GSRM. 

Our results showed that the gene with the highest 
effect on the simulated SBP l trait was detected at a 
significance level that surpasses a conservative multiple 
testing threshold for the p values. Figure 2 shows that 
our test statistic was not inflated when evaluated under 
the null hypothesis using the Ql trait and a random 
sample of genes. MAP4 was also consistently detected 



Table 2 Top 10 most significant results for genes in a combined sample of 100 random and 12 causal genes 
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geff, Gene-specific effect estimate (hggff)i geff J, significance of the gene-specific effect estimate; h2r, trait heritability estimate (h^); hlrjp, significance of the 
trait heritability estimate. 

•Known causal gene for SBP_1 in the simulated data set. 
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using the intragenic and nonsyn strategies (see Figures 1 
and 2), with other causal genes ranking within our first 
top 10 results. This seems to suggest that our test is 
sensitive and specific enough for the detection of true- 
positive signals without enrichment of false-positive 
ones. 

As a consequence of using a different strategy to 
select the SNPs for the estimation of the empirical 
GSRM, our results for MAP4 improved. MAP4 results 
were an order of magnitude less significant for the intra- 
genic than for the nonsyn strategy. We believe that this 
is the result of rare functional alleles driving the EKE of 
the GSRM matrices for the nonsyn strategy without the 
noise introduced by shared noncoding alleles. In effect, 
the nonsyn GSRM matrices better approximate the 
gene's probability of identity-by-descent sharing, thus 
making our test a gene-specific empirical test of linkage 
that is also robust to the heterogeneity of the causal 
variants. 

Finally, we want to note that our method is not 
restricted either to a particular measure of genetic iden- 
tity or to its estimation on a gene-specific basis; iden- 
tity-by-state and genomic regions, even if they are 
nonsyntenic [13], can potentially be used instead. 

Conclusions 

We were able to obtain encouraging, proof-of-concept 
results from the application of our method to GAW18 
data. We observed differences between the TKEs and 
their gene-specific empirical estimations. We obtained 
genome-wide significant results on the SBP l simulated 
trait for MAP4 that seem to indicate that our test is 
both specific and sensitive enough, and which also 
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Figure 2 Q-Q plot of the p values for the gene-specific effect 
estimates evaluated under the null hypothesis. The p values for 
the gene-specific effect estimates were calculated using SNPs 
selected with the intragenic strategy for a random sample of 5000 
genes, using the Ql trait, a trait highly heritable but not influenced 
by any of the GAW18 SNPs. 



suggest that our method is behaving as a gene-specific 
empirical test of linkage. 
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